Overview

There is mainly 3 parts to this story:

  1. A simple easy to communicate model of the key relationship

  2. A medium complexity smoothing analysis

  3. A full power time series analysis with causal inference

The two data set used in this analysis are the Madison case and waste water concentration data.

##           Date    Site Cases    N1 N1Error
## 435 2021-06-19 Madison     3    NA      NA
## 436 2021-06-20 Madison     8 49443    8812
## 437 2021-06-21 Madison     1 90447   20429
## 438 2021-06-22 Madison     3 44587   12427
## 439 2021-06-23 Madison     4 14469    9155
## 440 2021-06-24 Madison    NA 28023    7724

A simple display of the data shows the core components of this story. First that both data sets are extremely noisy. And that there is a hint of a relationship between the two signals.

MinMaxNorm <- function(Vec){#normalizes the data to range from 0 and 1
  normVec <- (Vec-min(Vec,na.rm=TRUE))/max(Vec,na.rm=TRUE)
  return(normVec)
}

NoNa <- function(DF,...){#Removes NA from the two reverent columns
  ColumnNames <- c(...)
  print(ColumnNames)
  NoNaDF <- DF%>%
    filter(
      across(
      .cols = ColumnNames,
      .fns = ~ !is.na(.x))
      )
  return(NoNaDF)
}

FirstImpression <- FullDF%>%
  NoNa("N1","Cases")%>%#Removing outliers
  ggplot(aes(x=Date))+#Data depends on time
  geom_line(aes(y=MinMaxNorm(N1), color="N1",info=N1))+#compares N1 to Cases
  geom_line(aes(y=MinMaxNorm(Cases), color="Cases",info=Cases))+
  labs(y="variable min max normalized")
## [1] "N1"    "Cases"
ggplotly(FirstImpression,tooltip=c("info","Date"))

From a first pass it is clear that the waste water measurements before 11/20/2020 did not function as an effective measure of the amount of waste water shed in the community. So for this analysis we are removing waste water data from before that point. Also there are some extreme outliers that we remove for more effective analysis

## [1] "N1"    "Cases"

A key component to this relationship is that the relationship between N1 and Case involves a gamma distribution modeling both the time between catching Covid-19 and getting a test and the concentration of the shedded particles. We found a gamma distribution with mean 11.73 days and a sd of 7.68 match’s other research and gives good results.

scale  <- 5.028338
shape  <- 2.332779 #These parameters are equivilent to the mean and sd above

weights <- dgamma(1:21, scale = scale, shape = shape)
plot(weights,  
            main=paste("Gamma Distribution with mean = 11.73 days,and SD = 7.68"), 
            ylab = "Weight", 
            xlab = "Lag")

SLDSmoothedDF <- FullDF3%>%
  mutate(SLDCases = c(rep(NA,20),
                        rollapply(Cases,width=21,FUN=weighted.mean,
                                  w=weights,
                                  na.rm = TRUE)))


SLDSmoothedDF%>%
  NoNa("N1","SLDCases")%>%
  ggplot(aes(x=Date))+
  geom_line(aes(y=MinMaxNorm(SLDCases), color="SLDCases"))+
  geom_line(aes(y=MinMaxNorm(N1), color="N1"))+
  facet_wrap(~Site)+
  labs(y="variable min max normalized")
## [1] "N1"       "SLDCases"

To isolate this relationship we used a primitive binning relationship. This clarifies the relationship we see hints of in the previous graphic and masks the noise in the data.

medianMean <- function(Vec){
  return(mean(replace(Vec, c(which.min(Vec), which.max(Vec)), NA), na.rm = TRUE))
}

StartDate <- 1
DaySmoothing <- 14
Lag <- 4

BinDF <- SLDSmoothedDF%>%
  select(Date, SLDCases, N1)%>%
    mutate(MovedCases = data.table::shift(SLDCases, Lag), 
           Week=(as.numeric(Date)+StartDate)%/%DaySmoothing)%>%
  group_by(Week)%>%
  #filter(Week>2670)%>%
  summarise(BinnedCases=mean(MovedCases, na.rm=TRUE), BinnedN1=exp(mean(log(N1), na.rm=TRUE)))



BinDF%>%
  ggplot()+
  geom_line(aes(x=Week, y=MinMaxNorm(BinnedN1), color="N1"))+
  geom_line(aes(x=Week, y=MinMaxNorm(BinnedCases), color="Cases"))+
  labs(y="Binned variable min max normalized")

BinDF%>%
  ggplot()+
  geom_point(aes(x=BinnedCases, y=BinnedN1))

cor(BinDF$BinnedN1, BinDF$BinnedCases, use="pairwise.complete.obs")
## [1] 0.8941983
summary(lm(BinnedCases~BinnedN1, data=BinDF))
## 
## Call:
## lm(formula = BinnedCases ~ BinnedN1, data = BinDF)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.903 -32.241  -5.098   5.512  87.714 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.329e+01  1.888e+01  -1.234    0.236    
## BinnedN1     8.530e-04  1.103e-04   7.736  1.3e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41.93 on 15 degrees of freedom
##   (16 observations deleted due to missingness)
## Multiple R-squared:  0.7996, Adjusted R-squared:  0.7862 
## F-statistic: 59.85 on 1 and 15 DF,  p-value: 1.298e-06

To generate this relationship without reducing the amount of data we rely on a Loess smoothing of the data. The first step is to get a more rigorous measure of the relationship of the data. Here we will use cross correlation. These figures sup port the work done in previous steps.

ccf(SLDSmoothedDF$Cases, SLDSmoothedDF$N1, na.action=na.pass)

ccf(SLDSmoothedDF$SLDCases, SLDSmoothedDF$N1, na.action=na.pass)

ccf(BinDF$BinnedCases, BinDF$BinnedN1, na.action=na.pass)

The loess smoothing is a way of generating smooth curves from noisy data. The displayed plot shows the visual power of this smoothing. We see a relationship in the big patterns but also multiple sub patterns match. We see in general that smoothed N1 both lags and leads the case data.

SLDSmoothedDF$loessN1 <- loessFit(y=(SLDSmoothedDF$N1), 
                      x=SLDSmoothedDF$Date, 
                      span=.2, 
                      iterations=3)$fitted


SLDSmoothedDF%>%
  NoNa("loessN1","SLDCases")%>%
  ggplot()+
  geom_line(aes(x=Date, y=MinMaxNorm(loessN1), color="loessN1"))+
  geom_line(aes(x=Date, y=MinMaxNorm(SLDCases), color="SLDCases"))+
  facet_wrap(~Site)+
  labs(y="variable min max normalized")
## [1] "loessN1"  "SLDCases"

ccf(SLDSmoothedDF$loessN1, SLDSmoothedDF$SLDCases, na.action=na.pass)

input <- initialize data
          loess
          SLD smooth
          
Analysis <- plot
Analysis <- CCF